<html>
<head><meta charset="utf-8"><title>crater issues · t-infra · Zulip Chat Archive</title></head>
<h2>Stream: <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/index.html">t-infra</a></h2>
<h3>Topic: <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html">crater issues</a></h3>

<hr>

<base href="https://rust-lang.zulipchat.com">

<head><link href="https://rust-lang.github.io/zulip_archive/style.css" rel="stylesheet"></head>

<a name="209851525"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/209851525" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#209851525">(Sep 12 2020 at 00:58)</a>:</h4>
<p>A Crater job just failed, and three agents are marked as 'Online' instead of 'Working': <a href="https://crater.rust-lang.org/ex/pr-76219">https://crater.rust-lang.org/ex/pr-76219</a></p>



<a name="209908766"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/209908766" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#209908766">(Sep 13 2020 at 01:41)</a>:</h4>
<p>One agent is currently marked 'Online' instead of 'Working'</p>



<a name="231150327"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231150327" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231150327">(Mar 20 2021 at 14:42)</a>:</h4>
<p>They are marked as 'Online' and 'Unreachable' respectively</p>



<a name="231213500"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231213500" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Mara <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231213500">(Mar 21 2021 at 12:27)</a>:</h4>
<p>The <a href="https://crater.rust-lang.org/">crater queue</a> shows <a href="https://github.com/rust-lang/rust/issues/82565">#82565</a> as 'running', but it also already continued with the second one in the queue. It should've already been finished. Any idea what happened here?</p>
<p><a href="https://crater.rust-lang.org/ex/pr-82565">https://crater.rust-lang.org/ex/pr-82565</a></p>
<blockquote>
<p>Estimated end:    22 minutes</p>
</blockquote>



<a name="231245558"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231245558" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Noah Lev <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231245558">(Mar 21 2021 at 22:30)</a>:</h4>
<p>Hmm, crater website is down for me...</p>



<a name="231245615"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231245615" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Noah Lev <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231245615">(Mar 21 2021 at 22:31)</a>:</h4>
<p>nvm, it's just loading really slowly</p>



<a name="231356936"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231356936" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Mara <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231356936">(Mar 22 2021 at 18:03)</a>:</h4>
<p>oh looks like the other PR it is 'running' is also stuck. that one just permanently displays <code>76%</code> it seems.</p>



<a name="231485119"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231485119" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231485119">(Mar 23 2021 at 15:03)</a>:</h4>
<p>all of this was due to an underlying problem with the crater agents on gcp</p>



<a name="231485169"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231485169" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231485169">(Mar 23 2021 at 15:03)</a>:</h4>
<p><span class="user-mention" data-user-id="116122">@simulacrum</span> restarted them, so hopefully everything should start working again</p>



<a name="231690098"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231690098" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231690098">(Mar 24 2021 at 19:00)</a>:</h4>
<p><a href="https://crater.rust-lang.org/">https://crater.rust-lang.org/</a> is now giving a 502</p>



<a name="231691524"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231691524" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231691524">(Mar 24 2021 at 19:09)</a>:</h4>
<p>gah the server oom'd</p>



<a name="231691540"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/231691540" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#231691540">(Mar 24 2021 at 19:09)</a>:</h4>
<p>(also, I'm wondering why our alerting for crater stopped working...)</p>



<a name="232084258"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/232084258" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Mara <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#232084258">(Mar 27 2021 at 11:38)</a>:</h4>
<p>crater is stuck again: <a href="https://crater.rust-lang.org/ex/pr-82781">https://crater.rust-lang.org/ex/pr-82781</a></p>



<a name="232754841"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/232754841" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Dirkjan Ochtman <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#232754841">(Apr 01 2021 at 13:21)</a>:</h4>
<p>Any updates? Would like to have a look at those results :)</p>



<a name="232940588"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/232940588" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#232940588">(Apr 02 2021 at 19:21)</a>:</h4>
<p>None of the agents are running except for <code>azure-1</code></p>



<a name="232978154"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/232978154" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#232978154">(Apr 03 2021 at 04:00)</a>:</h4>
<p>It seems like there have been many more Crater issues than usual recently - is there a common cause to any of them? At this rate, it's going to take over a month to get through the queue, which is going to hold up many prs</p>



<a name="232979237"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/232979237" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#232979237">(Apr 03 2021 at 04:20)</a>:</h4>
<p>I'm planning to try to fit some time in to investigate this problem this weekend, IIRC there's the normal problem of crater machines getting into a deadlock, but it's recently been exacerbated by our alerting going down for unknown reasons.</p>



<a name="233126612"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233126612" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233126612">(Apr 05 2021 at 01:27)</a>:</h4>
<p>(as may have been expected I did not find the time. Maybe this week)</p>



<a name="233199990"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233199990" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233199990">(Apr 05 2021 at 17:06)</a>:</h4>
<p><a href="https://crater.rust-lang.org/">https://crater.rust-lang.org/</a> is now showing a 502 error</p>



<a name="233204115"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233204115" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233204115">(Apr 05 2021 at 17:39)</a>:</h4>
<p>hm, investigating this</p>



<a name="233204949"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233204949" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233204949">(Apr 05 2021 at 17:46)</a>:</h4>
<p>ok, not sure why it was down, but crater.r-l.o is back up</p>



<a name="233205487"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233205487" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233205487">(Apr 05 2021 at 17:51)</a>:</h4>
<p>Two jobs are now marked as 'Generating report ' - I didn't think that was possible</p>



<a name="233205980"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233205980" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233205980">(Apr 05 2021 at 17:54)</a>:</h4>
<p>Trying to look into those</p>



<a name="233207655"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233207655" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233207655">(Apr 05 2021 at 18:06)</a>:</h4>
<p>seems like the retry-report should make this work</p>



<a name="233207684"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233207684" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233207684">(Apr 05 2021 at 18:06)</a>:</h4>
<p>though I have a suspected cause - looks like writing out full.html used roughly 80% of memory on the machine</p>



<a name="233209609"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233209609" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233209609">(Apr 05 2021 at 18:22)</a>:</h4>
<p>ok, stopped all the agents</p>



<a name="233209677"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233209677" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233209677">(Apr 05 2021 at 18:22)</a>:</h4>
<p>my current belief is that the problem we have is this:</p>
<ul>
<li>generating a report is done on the same thread as everything else, and essentially doesn't yield</li>
</ul>



<a name="233209715"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233209715" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233209715">(Apr 05 2021 at 18:23)</a>:</h4>
<p>and this particular server only has one cpu, so the webserver and that thread are the same thread</p>



<a name="233209740"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233209740" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233209740">(Apr 05 2021 at 18:23)</a>:</h4>
<p>i'm going to see if that is accurate (i.e. that the webserver runs in the same thread)</p>



<a name="233209757"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233209757" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233209757">(Apr 05 2021 at 18:23)</a>:</h4>
<p>it's possible that sqlite locks are the actual problem</p>



<a name="233210277"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233210277" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233210277">(Apr 05 2021 at 18:27)</a>:</h4>
<p>aha, ok, so we do in fact have several threads</p>



<a name="233210909"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233210909" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233210909">(Apr 05 2021 at 18:32)</a>:</h4>
<p>I am struggling on this relatively old environment to get a concise listing, but it seems like at least there's two tokio-runtime threads, one of which is using ~100% cpu</p>



<a name="233212994"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233212994" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233212994">(Apr 05 2021 at 18:50)</a>:</h4>
<p>hm, it <em>seems</em> like for whatever reason the usage of that one tokio thread is causing requests to timeout to the server, which seems unexpected</p>



<a name="233213242"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233213242" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233213242">(Apr 05 2021 at 18:52)</a>:</h4>
<p>unfortunately we don't have tokio logs at all</p>



<a name="233213274"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233213274" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233213274">(Apr 05 2021 at 18:52)</a>:</h4>
<p>I'm not sure if there's something we can do to inspect state at runtime, I don't really want to restart this log pushing</p>



<a name="233214072"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233214072" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233214072">(Apr 05 2021 at 18:58)</a>:</h4>
<p>it does look like we're doing a ton of disk i/o (lseek + read syscalls)</p>



<a name="233215439"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233215439" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233215439">(Apr 05 2021 at 19:08)</a>:</h4>
<p>it looks like we're writing ~15k crate logs in ~20 minutes, which  is roughly 12.5 logs / second... that feels maybe slow?</p>



<a name="233215511"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233215511" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233215511">(Apr 05 2021 at 19:09)</a>:</h4>
<p>perf top seems to point at ~20% being in sqlite exec and ~20% in kernel copy_user_enhanced_fast_string (plus 10% read/llseek)</p>



<a name="233216769"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233216769" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233216769">(Apr 05 2021 at 19:20)</a>:</h4>
<p>so I'll let that run, because it doesn't look like it's feasible to concurrently run that with a crater build, even though I can't fully tell why we're not serving requests nicely on the other thread</p>



<a name="233216800"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233216800" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233216800">(Apr 05 2021 at 19:20)</a>:</h4>
<p>needs another hour or so</p>



<a name="233217700"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233217700" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233217700">(Apr 05 2021 at 19:28)</a>:</h4>
<p>I'm wondering if this might be a symptom of using tokio 0.1 still</p>



<a name="233217768"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233217768" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233217768">(Apr 05 2021 at 19:29)</a>:</h4>
<p>but in any case, not diving into updating that now</p>



<a name="233217784"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233217784" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233217784">(Apr 05 2021 at 19:29)</a>:</h4>
<p>FWIW <a href="http://docs.rs">docs.rs</a> has had issues with FD leaks in the past, I would be surprised if those fixes were backported</p>



<a name="233217801"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233217801" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Joshua Nelson <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233217801">(Apr 05 2021 at 19:29)</a>:</h4>
<p>but it sounds like this is CPU related, not a resource leak</p>



<a name="233217813"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233217813" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233217813">(Apr 05 2021 at 19:29)</a>:</h4>
<p>I don't think that's at all related to the problems we're seeing here</p>



<a name="233218360"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233218360" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233218360">(Apr 05 2021 at 19:33)</a>:</h4>
<p>I'm going to try enabling logs before kicking off the next report</p>



<a name="233221952"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233221952" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233221952">(Apr 05 2021 at 20:03)</a>:</h4>
<p><span class="user-mention" data-user-id="121055">@Pietro Albini</span> can you comment on your availability for reviews etc on crater? It feels important to me to get it to a point where we don't need to baby sit it as much, and depending on if you're available I can see us doing that via code improvement (e.g., trying to upgrade tokio, which likely has some code changes required) or by resizing the machine things run on, if not</p>



<a name="233221978"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233221978" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233221978">(Apr 05 2021 at 20:03)</a>:</h4>
<p>I'm also happy to own reviewing and merging things, but don't know how comfortable you are with that</p>



<a name="233235490"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233235490" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233235490">(Apr 05 2021 at 21:42)</a>:</h4>
<p>and.... it seems like the majority of the slowness came from metrics queries?</p>



<a name="233235506"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233235506" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233235506">(Apr 05 2021 at 21:42)</a>:</h4>
<p>or that's my operating theory, which is downright weird</p>



<a name="233235559"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233235559" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233235559">(Apr 05 2021 at 21:43)</a>:</h4>
<p>I'm going to disable those temporarily and see if that helps</p>



<a name="233235861"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233235861" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233235861">(Apr 05 2021 at 21:46)</a>:</h4>
<p>definitely seeing some 500-900ms queries with trace logs enabled, though it's not obvious why</p>



<a name="233235922"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233235922" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233235922">(Apr 05 2021 at 21:46)</a>:</h4>
<p>SELECT * FROM experiments INNER JOIN experiment_crates ON experiment_crates.experiment = <a href="http://experiments.name">experiments.name</a> WHERE experiment_crates.assigned_to = ?1 AND experiment_crates.status = ?2 AND experiments.status = ?2 AND experiment_crates.skipped = 0 LIMIT 1 seems particularly slow</p>



<a name="233237695"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233237695" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233237695">(Apr 05 2021 at 22:03)</a>:</h4>
<p>ok, I've disabled metrics collection for now, and restarted the 4 crater instances as well as report generatino</p>



<a name="233237713"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233237713" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233237713">(Apr 05 2021 at 22:04)</a>:</h4>
<p>things seem to be largely healthy</p>



<a name="233237786"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233237786" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233237786">(Apr 05 2021 at 22:04)</a>:</h4>
<p>going to see how cpu usage does now that we're not hitting the metrics endpoint constantly</p>



<a name="233238683"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233238683" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233238683">(Apr 05 2021 at 22:13)</a>:</h4>
<p>so far seems to be doing good, will check back in an hour (well, actually, sit here staring at graphs in all probability because I want to refresh constantly now)</p>



<a name="233240089"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233240089" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233240089">(Apr 05 2021 at 22:28)</a>:</h4>
<p>unfortunately this means I didn't get to investigating the crater stoppage (i.e. when crater agents, not the webserver, fail), but my hope is that I can dedicate time to this soon.</p>



<a name="233253008"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233253008" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233253008">(Apr 06 2021 at 01:16)</a>:</h4>
<p><a href="/user_uploads/4715/YLmW8_8jqJiWJKNb8WtwGn_p/image.png">image.png</a></p>
<div class="message_inline_image"><a href="/user_uploads/4715/YLmW8_8jqJiWJKNb8WtwGn_p/image.png" title="image.png"><img src="/user_uploads/4715/YLmW8_8jqJiWJKNb8WtwGn_p/image.png"></a></div>



<a name="233253036"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233253036" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233253036">(Apr 06 2021 at 01:17)</a>:</h4>
<p>Looking much better after monitoring was turned off; still some spikes in CPU usage which are concerning, but overall no longer seeing the near-100% constant usage. I suspect the spikes might be the zulip archiving.</p>



<a name="233253109"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233253109" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Jake Goulding <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233253109">(Apr 06 2021 at 01:18)</a>:</h4>
<p>Monitoring became what it swore to destroy...</p>



<a name="233253160"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233253160" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233253160">(Apr 06 2021 at 01:20)</a>:</h4>
<p>FWIW, I'm pretty sure that the monitoring lacks some kind of exponential backoff or whatever, so when queries started failing due to high load (around the 2pm mark in that graph) I suspect it was just disconnecting before actually getting a response back - I saw 499 status in the logs</p>



<a name="233253212"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233253212" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233253212">(Apr 06 2021 at 01:20)</a>:</h4>
<p>(and likely retrying sooner than might be desired, but I don't know about that)</p>



<a name="233295694"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233295694" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233295694">(Apr 06 2021 at 10:15)</a>:</h4>
<p>catching up with the thread</p>



<a name="233295785"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233295785" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233295785">(Apr 06 2021 at 10:16)</a>:</h4>
<p><span class="user-mention silent" data-user-id="116122">simulacrum</span> <a href="#narrow/stream/242791-t-infra/topic/crater.20issues/near/233253160">said</a>:</p>
<blockquote>
<p>FWIW, I'm pretty sure that the monitoring lacks some kind of exponential backoff or whatever, so when queries started failing due to high load (around the 2pm mark in that graph) I suspect it was just disconnecting before actually getting a response back - I saw 499 status in the logs</p>
</blockquote>
<p>prometheus is currently configured to scrape the metrics every 5 seconds</p>



<a name="233295885"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233295885" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233295885">(Apr 06 2021 at 10:17)</a>:</h4>
<p>we could add a <code>scrape_interval: 30s</code> or even <code>scrape_interval: 1m</code> to just <a href="https://github.com/rust-lang/simpleinfra/blob/fefff4d492c02388091c543ef8921c4fd98e0fb0/ansible/playbooks/monitoring.yml#L76">the crater monitoring job</a></p>



<a name="233296001"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233296001" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233296001">(Apr 06 2021 at 10:18)</a>:</h4>
<p><span class="user-mention silent" data-user-id="116122">simulacrum</span> <a href="#narrow/stream/242791-t-infra/topic/crater.20issues/near/233221952">said</a>:</p>
<blockquote>
<p><span class="user-mention silent" data-user-id="121055">Pietro Albini</span> can you comment on your availability for reviews etc on crater? It feels important to me to get it to a point where we don't need to baby sit it as much, and depending on if you're available I can see us doing that via code improvement (e.g., trying to upgrade tokio, which likely has some code changes required) or by resizing the machine things run on, if not</p>
</blockquote>
<p>I can definitely review crater PRs unless they're absolutely huge to review</p>



<a name="233296368"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233296368" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233296368">(Apr 06 2021 at 10:22)</a>:</h4>
<p>for the slowness and crashes in generating the reports, my understanding is that it's due to a couple of issues:</p>
<ul>
<li>we generate the log archives and <code>full.html</code> loading everything in memory, which can OOM if we have too many things to put there</li>
<li>uploading to s3 is currently single-thread, we don't do concurrent uploads</li>
</ul>



<a name="233296428"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233296428" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233296428">(Apr 06 2021 at 10:23)</a>:</h4>
<p>I think the easiest temporary fix for the reports is to just discard all the logs for the categories not included in <code>summary.html</code></p>



<a name="233296518"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233296518" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233296518">(Apr 06 2021 at 10:24)</a>:</h4>
<p>actually, let me dump all this info in github issues</p>



<a name="233302538"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233302538" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233302538">(Apr 06 2021 at 11:28)</a>:</h4>
<p><span class="user-mention" data-user-id="116122">@simulacrum</span> did a dump of the issues I see in <a href="https://github.com/rust-lang/crater/labels/reliability">https://github.com/rust-lang/crater/labels/reliability</a></p>



<a name="233302586"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233302586" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233302586">(Apr 06 2021 at 11:28)</a>:</h4>
<p>need to eat something, will check back later</p>



<a name="233317803"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233317803" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233317803">(Apr 06 2021 at 13:25)</a>:</h4>
<p><span class="user-mention silent" data-user-id="121055">Pietro Albini</span> <a href="#narrow/stream/242791-t-infra/topic/crater.20issues/near/233295885">said</a>:</p>
<blockquote>
<p>we could add a <code>scrape_interval: 30s</code> or even <code>scrape_interval: 1m</code> to just <a href="https://github.com/rust-lang/simpleinfra/blob/fefff4d492c02388091c543ef8921c4fd98e0fb0/ansible/playbooks/monitoring.yml#L76">the crater monitoring job</a></p>
</blockquote>
<p>I think for now I'd like to hold off on adding this -- we should look at the queries needed to fulfill that request, as they're also used for e.g. the agents page, and optimize those. I don't think the metrics are super critical at this juncture.</p>



<a name="233318013"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233318013" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233318013">(Apr 06 2021 at 13:27)</a>:</h4>
<p>I think for now I'm guessing that we should be stable-ish as is, primarily want to get the crater agent stalls fixed, but whether that's with the replacement with a more minimal graph (or, well, more of a list) or something more targeted, I think we'll see</p>



<a name="233330187"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233330187" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233330187">(Apr 06 2021 at 14:42)</a>:</h4>
<p><span class="user-mention silent" data-user-id="116122">simulacrum</span> <a href="#narrow/stream/242791-t-infra/topic/crater.20issues/near/233317803">said</a>:</p>
<blockquote>
<p><span class="user-mention silent" data-user-id="121055">Pietro Albini</span> <a href="#narrow/stream/242791-t-infra/topic/crater.20issues/near/233295885">said</a>:</p>
<blockquote>
<p>we could add a <code>scrape_interval: 30s</code> or even <code>scrape_interval: 1m</code> to just <a href="https://github.com/rust-lang/simpleinfra/blob/fefff4d492c02388091c543ef8921c4fd98e0fb0/ansible/playbooks/monitoring.yml#L76">the crater monitoring job</a></p>
</blockquote>
<p>I think for now I'd like to hold off on adding this -- we should look at the queries needed to fulfill that request, as they're also used for e.g. the agents page, and optimize those. I don't think the metrics are super critical at this juncture.</p>
</blockquote>
<p>well, the metrics are useful to get alerts when an agent deadlocks</p>



<a name="233330455"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233330455" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233330455">(Apr 06 2021 at 14:44)</a>:</h4>
<p>if we scrape them with a long frequency we'll be able to detect whether the agents deadlocked or not without impacting the server too much</p>



<a name="233334252"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233334252" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233334252">(Apr 06 2021 at 15:03)</a>:</h4>
<p>I think we could bump granularity to something like 5 minutes maybe</p>



<a name="233334294"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233334294" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233334294">(Apr 06 2021 at 15:03)</a>:</h4>
<p>at least one minute</p>



<a name="233341561"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233341561" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233341561">(Apr 06 2021 at 15:34)</a>:</h4>
<p>I think one minute is fine</p>



<a name="233341625"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233341625" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233341625">(Apr 06 2021 at 15:35)</a>:</h4>
<p>I'll enable those in a bit at 1 minute then</p>



<a name="233341691"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233341691" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233341691">(Apr 06 2021 at 15:35)</a>:</h4>
<p>thanks <span aria-label="heart" class="emoji emoji-2764" role="img" title="heart">:heart:</span></p>



<a name="233345415"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233345415" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233345415">(Apr 06 2021 at 15:59)</a>:</h4>
<p>seems to be doing ok:</p>
<p><a href="/user_uploads/4715/2k9se66bLbkd2dhHwSpBPBpe/image.png">image.png</a></p>
<div class="message_inline_image"><a href="/user_uploads/4715/2k9se66bLbkd2dhHwSpBPBpe/image.png" title="image.png"><img src="/user_uploads/4715/2k9se66bLbkd2dhHwSpBPBpe/image.png"></a></div>



<a name="233345457"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233345457" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233345457">(Apr 06 2021 at 15:59)</a>:</h4>
<p>but definitely not insignificant addition to our load</p>



<a name="233489893"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233489893" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233489893">(Apr 07 2021 at 13:46)</a>:</h4>
<p><span class="user-mention" data-user-id="121055">@Pietro Albini</span> next time crater stalls out, can you ping me and not reboot it? I'd like to try to capture a backtrace and such</p>



<a name="233489920"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233489920" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233489920">(Apr 07 2021 at 13:46)</a>:</h4>
<p>sure!</p>



<a name="233853414"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233853414" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233853414">(Apr 09 2021 at 16:34)</a>:</h4>
<p><span class="user-mention" data-user-id="121055">@Pietro Albini</span> okay, gcp-1 is stalled out right now, and I think I found the bug - <a href="https://github.com/rust-lang/crater/pull/569">https://github.com/rust-lang/crater/pull/569</a></p>



<a name="233853475"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233853475" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233853475">(Apr 09 2021 at 16:34)</a>:</h4>
<p>I can restart it but was going to ping in case you wanted to investigate as well, I think that PR should hopefully fix at least one case of stalled workers though.</p>



<a name="233853546"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233853546" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233853546">(Apr 09 2021 at 16:35)</a>:</h4>
<p>(to avoid slowing down our queue since it's pretty long and since this happens not infrequently I'll restart in an hour or two regardless)</p>



<a name="233853774"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233853774" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233853774">(Apr 09 2021 at 16:36)</a>:</h4>
<p>please restart, I'll be able to take a look at that PR later today or early tomorrow</p>



<a name="233853804"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233853804" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233853804">(Apr 09 2021 at 16:36)</a>:</h4>
<p>thanks for drilling down into this <span aria-label="heart" class="emoji emoji-2764" role="img" title="heart">:heart:</span></p>



<a name="233853842"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233853842" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233853842">(Apr 09 2021 at 16:37)</a>:</h4>
<p>sounds good</p>



<a name="233958360"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233958360" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> lqd <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233958360">(Apr 10 2021 at 14:15)</a>:</h4>
<p>it seems <a href="https://crater.rust-lang.org/">https://crater.rust-lang.org/</a> is showing 502s again</p>



<a name="233965195"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233965195" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233965195">(Apr 10 2021 at 15:35)</a>:</h4>
<p>yeah, fixing it</p>



<a name="233965760"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/233965760" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> lqd <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#233965760">(Apr 10 2021 at 15:44)</a>:</h4>
<p>awesome, thanks a bunch :)</p>



<a name="234049881"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234049881" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234049881">(Apr 11 2021 at 14:38)</a>:</h4>
<p>ok, seems like there's another deadlock - crater-azure-2 has my patch but stalled out on all workers in the blocked state as far as I can tell</p>



<a name="234050587"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234050587" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234050587">(Apr 11 2021 at 14:48)</a>:</h4>
<p>trying to investigate before restarting</p>



<a name="234052145"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234052145" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234052145">(Apr 11 2021 at 15:02)</a>:</h4>
<div class="codehilite"><pre><span></span><code>ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::runner::worker] task failed, marking childs as failed too: doc beta-2021-03-27 of crate sheesy-cli-4.0.11
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::utils] No such file or directory (os error 2)
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::utils] note: run with `RUST_BACKTRACE=1` to display a backtrace.
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z INFO  rustwide::cmd] [stdout] 26f27f650d6db6f9a05caab6fb26b2b64aa53b47709d67f161ee63d189a5bd3b
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z TRACE crater::runner::graph] worker-11 | NodeIndex(3754) prevented recursive mark_as_failed as it has other parents
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z DEBUG crater::runner::graph] marking task doc beta-2021-03-27 of crate sheesy-cli-4.0.11 as failed
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::runner::tasks] this task or one of its parent failed!
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::utils] No such file or directory (os error 2)
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::utils] note: run with `RUST_BACKTRACE=1` to display a backtrace.
ure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z INFO  crater::agent::results] sending results to the crater server...
</code></pre></div>



<a name="234052196"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234052196" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234052196">(Apr 11 2021 at 15:02)</a>:</h4>
<p>is the interesting piece of the log</p>



<a name="234052242"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234052242" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234052242">(Apr 11 2021 at 15:03)</a>:</h4>
<p>as far as I can tell, that unit in the task graph was not marked as completed by this line <a href="https://github.com/rust-lang/crater/blob/e8f8ace1476107d4bd01df73f0645bba1b2451c0/src/runner/graph.rs#L241">https://github.com/rust-lang/crater/blob/e8f8ace1476107d4bd01df73f0645bba1b2451c0/src/runner/graph.rs#L241</a></p>



<a name="234052545"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234052545" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234052545">(Apr 11 2021 at 15:04)</a>:</h4>
<p>but since the log shows the marking task as failed line, that implies that we early-exited from that function in the ? on task.mark_as_failed</p>



<a name="234052558"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234052558" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234052558">(Apr 11 2021 at 15:04)</a>:</h4>
<p>I'm trying to verify that theory now</p>



<a name="234053008"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234053008" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234053008">(Apr 11 2021 at 15:08)</a>:</h4>
<p>according to the logs on the coordination server, two record-progress requests at 03:50:28Z came in from this machine; the first took 3.648 seconds due to a concurrent metrics request, and the second took 0.113 seconds (if I'm interpreting the log right). The returned status on both is 200.</p>



<a name="234053039"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234053039" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234053039">(Apr 11 2021 at 15:09)</a>:</h4>
<p>I'm not sure why there's two requests though, as the log on the crater machine seems to only indicate one</p>



<a name="234053761"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234053761" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234053761">(Apr 11 2021 at 15:19)</a>:</h4>
<p>database doesn't seem to have a record of these requests, at least as far as I can tell- </p>
<div class="codehilite"><pre><span></span><code>sqlite&gt; select * from results where experiment = &#39;beta-1.52-rustdoc-1&#39; and crate like &#39;%sheesy-cli%&#39;;
experiment,crate,toolchain,result,log,encoding
beta-1.52-rustdoc-1,reg/sheesy-cli/4.0.11,1.51.0,error,&quot;,gzip
</code></pre></div>



<a name="234053783"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234053783" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234053783">(Apr 11 2021 at 15:19)</a>:</h4>
<p>(that's the run of this crate from the before version, not the beta-... version seen failing above)</p>



<a name="234054634"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234054634" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234054634">(Apr 11 2021 at 15:30)</a>:</h4>
<p>I guess I should look at the API endpoint to see if it can return 200 while failing</p>



<a name="234054860"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234054860" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234054860">(Apr 11 2021 at 15:33)</a>:</h4>
<p>er, I misread the HTTP log - there were <em>3</em> requests, not 2, from the azure-2 machine</p>



<a name="234054870"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234054870" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234054870">(Apr 11 2021 at 15:33)</a>:</h4>
<p>and the crater server log also has the same</p>



<a name="234054977"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234054977" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234054977">(Apr 11 2021 at 15:34)</a>:</h4>
<p>which is pretty weird</p>



<a name="234055293"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234055293" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234055293">(Apr 11 2021 at 15:39)</a>:</h4>
<p>aha, I have an idea now</p>



<a name="234055301"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234055301" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234055301">(Apr 11 2021 at 15:39)</a>:</h4>
<p>the timestamp in the nginx logs is the <em>end</em> of the request most likely</p>



<a name="234055389"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234055389" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234055389">(Apr 11 2021 at 15:40)</a>:</h4>
<p>and indeed both of the first two took ~3 seconds to process, so they're unrelated to this, likely issued ~3 seconds earlier</p>



<a name="234055413"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234055413" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234055413">(Apr 11 2021 at 15:40)</a>:</h4>
<p>and the agent log has two progress's sent at :24, so those are probably these</p>



<a name="234055424"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234055424" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234055424">(Apr 11 2021 at 15:41)</a>:</h4>
<p>ok, so that means we sent a single progress report, no retries</p>



<a name="234055599"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234055599" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234055599">(Apr 11 2021 at 15:43)</a>:</h4>
<p>there's a bunch of early exits but I don't really know if they could be responsible for the failure to record the result in the db, or maybe I'm even reading the db wrong</p>



<a name="234056555"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056555" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056555">(Apr 11 2021 at 15:56)</a>:</h4>
<p>ok, so we think the crate is running</p>



<a name="234056652"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056652" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056652">(Apr 11 2021 at 15:58)</a>:</h4>
<div class="codehilite"><pre><span></span><code>sqlite&gt; select * from experiment_crates where experiment = &#39;beta-1.52-rustdoc-1&#39; and crate like &#39;%sheesy-cli%&#39;;
experiment,crate,skipped,status,assigned_to
beta-1.52-rustdoc-1,reg/sheesy-cli/4.0.11,0,running,agent:azure-2
</code></pre></div>



<a name="234056674"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056674" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056674">(Apr 11 2021 at 15:58)</a>:</h4>
<p>I'm going to see if I can find the other half of the crate in the agent logs</p>



<a name="234056705"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056705" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056705">(Apr 11 2021 at 15:59)</a>:</h4>
<div class="codehilite"><pre><span></span><code>Apr 11 03:50:28 crater-azure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z ERROR crater::runner::worker] task failed, marking childs as failed too: doc beta-2021-03-27 of crate sheesy-cli-4.0.11
Apr 11 03:50:28 crater-azure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:28Z DEBUG crater::runner::graph] marking task doc beta-2021-03-27 of crate sheesy-cli-4.0.11 as failed
Apr 11 03:50:30 crater-azure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:30Z ERROR crater::runner::worker] task failed, marking childs as failed too: doc 1.51.0 of crate sheesy-cli-4.0.11
Apr 11 03:50:30 crater-azure-2.infra.rust-lang.org docker[57987]: [2021-04-11T03:50:30Z DEBUG crater::runner::graph] marking task doc 1.51.0 of crate sheesy-cli-4.0.11 as failed
</code></pre></div>



<a name="234056708"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056708" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056708">(Apr 11 2021 at 15:59)</a>:</h4>
<p>ok so both parts failed</p>



<a name="234056903"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056903" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056903">(Apr 11 2021 at 16:02)</a>:</h4>
<p>and we seem to have the http request on the nginx and app logs on the server</p>



<a name="234056909"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234056909" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234056909">(Apr 11 2021 at 16:02)</a>:</h4>
<p>but neither has recorded a result?</p>



<a name="234057201"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234057201" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234057201">(Apr 11 2021 at 16:07)</a>:</h4>
<p>oh, I guess no, the second one did</p>



<a name="234057211"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234057211" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234057211">(Apr 11 2021 at 16:07)</a>:</h4>
<p>but the first one didn't</p>



<a name="234057261"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234057261" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234057261">(Apr 11 2021 at 16:08)</a>:</h4>
<p>and I guess the second one is not present in the later logs, so it presumably successfully deleted itself from the task graph as well</p>



<a name="234057988"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234057988" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234057988">(Apr 11 2021 at 16:21)</a>:</h4>
<p>oh, I have a thought - if I get the threads to exit, the logs should get the error message logged</p>



<a name="234058108"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234058108" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234058108">(Apr 11 2021 at 16:23)</a>:</h4>
<p>hm ok I think I failed to do that</p>



<a name="234058181"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234058181" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234058181">(Apr 11 2021 at 16:24)</a>:</h4>
<p>oh well</p>



<a name="234058188"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234058188" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234058188">(Apr 11 2021 at 16:24)</a>:</h4>
<p>anyway, restarted azure-2, will post a PR shortly that should help with this</p>



<a name="234059471"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234059471" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234059471">(Apr 11 2021 at 16:46)</a>:</h4>
<p><a href="https://github.com/rust-lang/crater/pull/570">https://github.com/rust-lang/crater/pull/570</a>, also adds some logging to help track down the cause (which I still don't know)</p>



<a name="234070712"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234070712" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234070712">(Apr 11 2021 at 19:47)</a>:</h4>
<p><span class="user-mention" data-user-id="121055">@Pietro Albini</span> do you mind if I deploy the logging change (in the second commit in this case, but also more generally in the future to help debug future cases) without review by you?</p>



<a name="234070909"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234070909" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234070909">(Apr 11 2021 at 19:50)</a>:</h4>
<p><span class="user-mention" data-user-id="116122">@simulacrum</span> thanks for the ping, approved</p>



<a name="234508794"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234508794" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234508794">(Apr 14 2021 at 14:33)</a>:</h4>
<p>FWIW I have not seen further stalls after these two bugs were fixed</p>



<a name="234508818"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234508818" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234508818">(Apr 14 2021 at 14:33)</a>:</h4>
<p>which seems positive!</p>



<a name="234508915"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234508915" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234508915">(Apr 14 2021 at 14:34)</a>:</h4>
<p>otoh, I'm a bit surprised, as my guess is we didn't fix <em>everything</em></p>



<a name="234508998"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/234508998" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#234508998">(Apr 14 2021 at 14:34)</a>:</h4>
<p>so I think the old policy of "don't restart if it goes down" should continue to be the case <span class="user-mention" data-user-id="121055">@Pietro Albini</span> -- I want to iron out any remaining bugs as they arise</p>



<a name="238182067"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238182067" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238182067">(May 10 2021 at 17:38)</a>:</h4>
<p>gcp-1 and gcp-2 are marked as unreachable</p>



<a name="238182230"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238182230" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238182230">(May 10 2021 at 17:39)</a>:</h4>
<p>Known, thanks - it'll likely be some time before they're brought back up at this point</p>



<a name="238182771"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238182771" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Jake Goulding <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238182771">(May 10 2021 at 17:42)</a>:</h4>
<p>One <em>could</em> say they... cratered</p>



<a name="238608920"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238608920" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238608920">(May 13 2021 at 11:24)</a>:</h4>
<p>The job <a href="https://crater.rust-lang.org/ex/pr-84920">https://crater.rust-lang.org/ex/pr-84920</a> has no agents assigned, but isn't complete</p>



<a name="238618777"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238618777" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238618777">(May 13 2021 at 13:13)</a>:</h4>
<p>hmm</p>



<a name="238618788"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238618788" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238618788">(May 13 2021 at 13:13)</a>:</h4>
<p>I wonder if that's because of the crater stuff</p>



<a name="238618802"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238618802" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238618802">(May 13 2021 at 13:13)</a>:</h4>
<p>let me see if prioritizing that run will let it finish up</p>



<a name="238619068"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619068" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619068">(May 13 2021 at 13:15)</a>:</h4>
<p>ok, prioritized, will check back in later</p>



<a name="238619084"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619084" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619084">(May 13 2021 at 13:15)</a>:</h4>
<p>oh I think why this is happening</p>



<a name="238619114"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619114" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619114">(May 13 2021 at 13:15)</a>:</h4>
<p>some chunks of the distributed experiments are still assigned to the gcp agents</p>



<a name="238619131"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619131" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619131">(May 13 2021 at 13:16)</a>:</h4>
<p>oh, perhaps</p>



<a name="238619577"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619577" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619577">(May 13 2021 at 13:20)</a>:</h4>
<p>yep</p>



<a name="238619590"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619590" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619590">(May 13 2021 at 13:20)</a>:</h4>
<div class="codehilite"><pre><span></span><code>sqlite&gt; select assigned_to, count(*) from experiment_crates where status = &#39;running&#39; group by assigned_to;
agent:azure-1|12
agent:azure-2|299
agent:gcp-1|324
agent:gcp-2|954
sqlite&gt;
</code></pre></div>



<a name="238619742"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619742" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619742">(May 13 2021 at 13:21)</a>:</h4>
<p>ran</p>
<div class="codehilite" data-code-language="SQL"><pre><span></span><code><span class="k">update</span> <span class="n">experiment_crates</span> <span class="k">set</span> <span class="n">status</span> <span class="o">=</span> <span class="s1">'queued'</span><span class="p">,</span> <span class="n">assigned_to</span> <span class="o">=</span> <span class="k">null</span> <span class="k">where</span> <span class="n">status</span> <span class="o">=</span> <span class="s1">'running'</span> <span class="k">and</span> <span class="n">assigned_to</span> <span class="o">=</span> <span class="s1">'agent:gcp-1'</span><span class="p">;</span>
<span class="k">update</span> <span class="n">experiment_crates</span> <span class="k">set</span> <span class="n">status</span> <span class="o">=</span> <span class="s1">'queued'</span><span class="p">,</span> <span class="n">assigned_to</span> <span class="o">=</span> <span class="k">null</span> <span class="k">where</span> <span class="n">status</span> <span class="o">=</span> <span class="s1">'running'</span> <span class="k">and</span> <span class="n">assigned_to</span> <span class="o">=</span> <span class="s1">'agent:gcp-2'</span><span class="p">;</span>
</code></pre></div>



<a name="238619764"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619764" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619764">(May 13 2021 at 13:21)</a>:</h4>
<p>hopefully this will fix it</p>



<a name="238619833"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238619833" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238619833">(May 13 2021 at 13:21)</a>:</h4>
<p>the next time one of the azure agents finishes a chunk they should pick some of those crates up</p>



<a name="238620091"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620091" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620091">(May 13 2021 at 13:23)</a>:</h4>
<p>btw <span class="user-mention" data-user-id="125294">@Aaron Hill</span>, could we keep a single topic for crater issues to keep the topics manageable?</p>



<a name="238620146"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620146" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620146">(May 13 2021 at 13:24)</a>:</h4>
<p>Sure :)</p>



<a name="238620162"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620162" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620162">(May 13 2021 at 13:24)</a>:</h4>
<p>thanks <span aria-label="heart" class="emoji emoji-2764" role="img" title="heart">:heart:</span></p>



<a name="238620180"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620180" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620180">(May 13 2021 at 13:24)</a>:</h4>
<div class="codehilite"><pre><span></span><code>sqlite&gt; select experiment, assigned_to, count(*) from experiment_crates where status = &#39;running&#39; group by assigned_to, experiment;
beta-1.53-1|agent:azure-1|1013
beta-1.53-1|agent:azure-2|251
sqlite&gt;
</code></pre></div>



<a name="238620193"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620193" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620193">(May 13 2021 at 13:24)</a>:</h4>
<p>current status</p>



<a name="238620238"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620238" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620238">(May 13 2021 at 13:24)</a>:</h4>
<p>once azure-2 goes to zero it should pick up the older run</p>



<a name="238620321"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620321" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620321">(May 13 2021 at 13:25)</a>:</h4>
<p>I'll file an issue for not hitting this in the future (we should deschedule things from unreachable nodes or something)</p>



<a name="238620379"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620379" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620379">(May 13 2021 at 13:25)</a>:</h4>
<p>definitely, thanks for doing so mark!</p>



<a name="238620634"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238620634" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238620634">(May 13 2021 at 13:28)</a>:</h4>
<p><a href="https://github.com/rust-lang/crater/issues/577">https://github.com/rust-lang/crater/issues/577</a></p>



<a name="238628498"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238628498" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238628498">(May 13 2021 at 14:27)</a>:</h4>
<div class="codehilite"><pre><span></span><code>sqlite&gt; select experiment, assigned_to, count(*) from experiment_crates where status = &#39;running&#39; group by assigned_to, experiment;
beta-1.53-1|agent:azure-1|297
pr-84920|agent:azure-2|712
sqlite&gt;
</code></pre></div>



<a name="238628573"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/238628573" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#238628573">(May 13 2021 at 14:28)</a>:</h4>
<p>that seems to have worked <span class="user-mention" data-user-id="125294">@Aaron Hill</span>!</p>



<a name="239302247"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/239302247" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#239302247">(May 18 2021 at 18:25)</a>:</h4>
<p>How long are the two gcp agents expected to remain down for? I have a PR at the end of the queue, and at this rate, it's going to be a very long time before it gets a chance to run</p>



<a name="239303901"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/239303901" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> simulacrum <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#239303901">(May 18 2021 at 18:35)</a>:</h4>
<p>We don't currently have a timeline, but I suspect that we'll arrange for more capacity somehow if the queue gets (even) longer.</p>



<a name="240002477"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240002477" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Aaron Hill <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240002477">(May 24 2021 at 04:19)</a>:</h4>
<p>The queue now has 6 jobs, three of which are <code>cargo build</code> / <code>cargo test</code></p>



<a name="240080923"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240080923" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240080923">(May 24 2021 at 17:09)</a>:</h4>
<p>yeah :(</p>



<a name="240080937"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240080937" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240080937">(May 24 2021 at 17:09)</a>:</h4>
<p>we're figuring out what to do on this topic</p>



<a name="240445981"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240445981" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240445981">(May 27 2021 at 10:31)</a>:</h4>
<p><span class="user-mention" data-user-id="125294">@Aaron Hill</span> we temporarily added an aws agent while we figure out a more permanent solution <span aria-label="tada" class="emoji emoji-1f389" role="img" title="tada">:tada:</span></p>



<a name="240445990"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240445990" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240445990">(May 27 2021 at 10:31)</a>:</h4>
<p><a href="/user_uploads/4715/2TC-mqWgIWhF-oEp0eUnohGm/Screenshot-from-2021-05-27-12-30-31.png">Screenshot-from-2021-05-27-12-30-31.png</a></p>
<div class="message_inline_image"><a href="/user_uploads/4715/2TC-mqWgIWhF-oEp0eUnohGm/Screenshot-from-2021-05-27-12-30-31.png" title="Screenshot-from-2021-05-27-12-30-31.png"><img src="/user_uploads/4715/2TC-mqWgIWhF-oEp0eUnohGm/Screenshot-from-2021-05-27-12-30-31.png"></a></div>



<a name="240446110"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240446110" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240446110">(May 27 2021 at 10:32)</a>:</h4>
<p>note that the agent might go down for a bit as it's an aws spot instance</p>



<a name="240446136"></a>
<h4><a href="https://rust-lang.zulipchat.com#narrow/stream/242791-t-infra/topic/crater%20issues/near/240446136" class="zl"><img src="https://rust-lang.github.io/zulip_archive/assets/img/zulip.svg" alt="view this post on Zulip" style="width:20px;height:20px;"></a> Pietro Albini <a href="https://rust-lang.github.io/zulip_archive/stream/242791-t-infra/topic/crater.20issues.html#240446136">(May 27 2021 at 10:32)</a>:</h4>
<p>we have monitoring in place so we shouldn't need a ping when it's down :)</p>



<hr><p>Last updated: Aug 07 2021 at 22:04 UTC</p>
</html>